In August 2013 the Bay Area Bike Share system began operating in the San Francisco Bay Area of California. Customers can go to a dock (bike station), take the bike after having unlocked it through the App, and then they can leave the bicycle in one of the stations that are based around the city. The system allocated half of its 700 bicycle fleet in San Francisco, In 2015, it was announced that the scheme would expand to 7,000 bikes, over 2016–2017, it became Ford GoBike because of the partnership with Ford Motor Company. wikipedia.org/wiki
In this project I will focus on information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area (In February 2019).Fordgobike-Tripdata/201902
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
%matplotlib inline
#read csv file
df = pd.read_csv('201902-fordgobike-tripdata.csv')
df.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52185 | 2019-02-28 17:32:10.1450 | 2019-03-01 08:01:55.9750 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984.0 | Male | No |
| 1 | 42521 | 2019-02-28 18:53:21.7890 | 2019-03-01 06:42:03.0560 | 23.0 | The Embarcadero at Steuart St | 37.791464 | -122.391034 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 2535 | Customer | NaN | NaN | No |
| 2 | 61854 | 2019-02-28 12:13:13.2180 | 2019-03-01 05:24:08.1460 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972.0 | Male | No |
| 3 | 36490 | 2019-02-28 17:54:26.0100 | 2019-03-01 04:02:36.8420 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989.0 | Other | No |
| 4 | 1585 | 2019-02-28 23:54:18.5490 | 2019-03-01 00:20:44.0740 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes |
#discover types of data and look for null value
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 183412 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 183412 non-null int64 1 start_time 183412 non-null object 2 end_time 183412 non-null object 3 start_station_id 183215 non-null float64 4 start_station_name 183215 non-null object 5 start_station_latitude 183412 non-null float64 6 start_station_longitude 183412 non-null float64 7 end_station_id 183215 non-null float64 8 end_station_name 183215 non-null object 9 end_station_latitude 183412 non-null float64 10 end_station_longitude 183412 non-null float64 11 bike_id 183412 non-null int64 12 user_type 183412 non-null object 13 member_birth_year 175147 non-null float64 14 member_gender 175147 non-null object 15 bike_share_for_all_trip 183412 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.4+ MB
#count null value
df.isnull().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 197 start_station_name 197 start_station_latitude 0 start_station_longitude 0 end_station_id 197 end_station_name 197 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 8265 member_gender 8265 bike_share_for_all_trip 0 dtype: int64
#summary statistic
df.describe()
| duration_sec | start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | member_birth_year | |
|---|---|---|---|---|---|---|---|---|---|
| count | 183412.000000 | 183215.000000 | 183412.000000 | 183412.000000 | 183215.000000 | 183412.000000 | 183412.000000 | 183412.000000 | 175147.000000 |
| mean | 726.078435 | 138.590427 | 37.771223 | -122.352664 | 136.249123 | 37.771427 | -122.352250 | 4472.906375 | 1984.806437 |
| std | 1794.389780 | 111.778864 | 0.099581 | 0.117097 | 111.515131 | 0.099490 | 0.116673 | 1664.383394 | 10.116689 |
| min | 61.000000 | 3.000000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 1878.000000 |
| 25% | 325.000000 | 47.000000 | 37.770083 | -122.412408 | 44.000000 | 37.770407 | -122.411726 | 3777.000000 | 1980.000000 |
| 50% | 514.000000 | 104.000000 | 37.780760 | -122.398285 | 100.000000 | 37.781010 | -122.398279 | 4958.000000 | 1987.000000 |
| 75% | 796.000000 | 239.000000 | 37.797280 | -122.286533 | 235.000000 | 37.797320 | -122.288045 | 5502.000000 | 1992.000000 |
| max | 85444.000000 | 398.000000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 2001.000000 |
df.duplicated().sum()
0
df_new = df.copy()
#delete unrequired columns
df_new = df_new.drop(['start_station_latitude', 'start_station_longitude', 'end_station_latitude', 'end_station_longitude'], axis=1)
df_new.dropna(inplace=True)
#convert data type of(start_time, end_time) to datetime
df_new['start_time'] = pd.to_datetime(df_new['start_time'])
df_new['end_time'] = pd.to_datetime(df_new['end_time'])
#convert data type of (start_station_id , end_station_id, bike_id) to str
col_id = ['start_station_id' , 'end_station_id', 'bike_id']
for i in col_id:
df_new[i] = df_new[i].astype(str)
#convert data type of (member_birth_year) to int
df_new['member_birth_year'] = df_new['member_birth_year'].astype(int)
#convert data type of (user_type, member_gender) to category
df_new[['user_type', 'member_gender']] = df_new[['user_type', 'member_gender']].astype('category')
df_new.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null datetime64[ns] 2 end_time 174952 non-null datetime64[ns] 3 start_station_id 174952 non-null object 4 start_station_name 174952 non-null object 5 end_station_id 174952 non-null object 6 end_station_name 174952 non-null object 7 bike_id 174952 non-null object 8 user_type 174952 non-null category 9 member_birth_year 174952 non-null int32 10 member_gender 174952 non-null category 11 bike_share_for_all_trip 174952 non-null object dtypes: category(2), datetime64[ns](2), int32(1), int64(1), object(6) memory usage: 14.3+ MB
df_new.member_birth_year.describe()
count 174952.000000 mean 1984.803135 std 10.118731 min 1878.000000 25% 1980.000000 50% 1987.000000 75% 1992.000000 max 2001.000000 Name: member_birth_year, dtype: float64
#set member_birth_day to 1978
df_new['member_birth_year'] = df_new['member_birth_year'].replace([1878], 1978)
df_new.member_birth_year.describe()
count 174952.000000 mean 1984.803706 std 10.115522 min 1900.000000 25% 1980.000000 50% 1987.000000 75% 1992.000000 max 2001.000000 Name: member_birth_year, dtype: float64
#create duration_min in minutes more clear than seconds
df_new['duration_min'] = df_new['duration_sec'] / 60
df_new.duration_min.describe()
count 174952.000000 mean 11.733379 std 27.370082 min 1.016667 25% 5.383333 50% 8.500000 75% 13.150000 max 1409.133333 Name: duration_min, dtype: float64
#create columns (day, hour) for start and end trip `https://pandas.pydata.org/docs/reference/series.html`
df_new['start_day'] = df_new['start_time'].dt.day_name()
df_new['end_day'] = df_new['end_time'].dt.day_name()
df_new['start_hour'] = df_new['start_time'].dt.hour
df_new['end_hour'] = df_new['end_time'].dt.hour
#creat colum for member's age
df_new['member_age'] = 2019 - df_new['member_birth_year']
df_new.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | end_station_id | end_station_name | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | duration_min | start_day | end_day | start_hour | end_hour | member_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52185 | 2019-02-28 17:32:10.145 | 2019-03-01 08:01:55.975 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 13.0 | Commercial St at Montgomery St | 4902 | Customer | 1984 | Male | No | 869.750000 | Thursday | Friday | 17 | 8 | 35 |
| 2 | 61854 | 2019-02-28 12:13:13.218 | 2019-03-01 05:24:08.146 | 86.0 | Market St at Dolores St | 3.0 | Powell St BART Station (Market St at 4th St) | 5905 | Customer | 1972 | Male | No | 1030.900000 | Thursday | Friday | 12 | 5 | 47 |
| 3 | 36490 | 2019-02-28 17:54:26.010 | 2019-03-01 04:02:36.842 | 375.0 | Grove St at Masonic Ave | 70.0 | Central Ave at Fell St | 6638 | Subscriber | 1989 | Other | No | 608.166667 | Thursday | Friday | 17 | 4 | 30 |
| 4 | 1585 | 2019-02-28 23:54:18.549 | 2019-03-01 00:20:44.074 | 7.0 | Frank H Ogawa Plaza | 222.0 | 10th Ave at E 15th St | 4898 | Subscriber | 1974 | Male | Yes | 26.416667 | Thursday | Friday | 23 | 0 | 45 |
| 5 | 1793 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93.0 | 4th St at Mission Bay Blvd S | 323.0 | Broadway at Kearny | 5200 | Subscriber | 1959 | Male | No | 29.883333 | Thursday | Friday | 23 | 0 | 60 |
df_new.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null datetime64[ns] 2 end_time 174952 non-null datetime64[ns] 3 start_station_id 174952 non-null object 4 start_station_name 174952 non-null object 5 end_station_id 174952 non-null object 6 end_station_name 174952 non-null object 7 bike_id 174952 non-null object 8 user_type 174952 non-null category 9 member_birth_year 174952 non-null int32 10 member_gender 174952 non-null category 11 bike_share_for_all_trip 174952 non-null object 12 duration_min 174952 non-null float64 13 start_day 174952 non-null object 14 end_day 174952 non-null object 15 start_hour 174952 non-null int64 16 end_hour 174952 non-null int64 17 member_age 174952 non-null int32 dtypes: category(2), datetime64[ns](2), float64(1), int32(2), int64(3), object(8) memory usage: 21.7+ MB
df_new.to_csv('fordgobike_201902_clean.csv', index=False)
In my new data there are 174952 individual rides made in bike-share system with 18 features that represent:
[duration_sec, duration_min, start_time, end_time, start_day, end_day, start_hour, end_hour] [start_station_id, start_station_name, end_station_id, end_station_name][bike_id, bike_share_for_all_trip][user_type, member_birth_year, member_gender, member_age] [duration_min] is my main feature of interest.I expect that the days of week and the hours of day [start_day, start_hour] will have effect on the duration of trip. I also think that the user info [user_type, member_gender, member_age] will help find out the main target users.
#load clean data
fordbike_2019 = pd.read_csv('fordgobike_201902_clean.csv')
fordbike_2019.describe()
| duration_sec | start_station_id | end_station_id | bike_id | member_birth_year | duration_min | start_hour | end_hour | member_age | |
|---|---|---|---|---|---|---|---|---|---|
| count | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 |
| mean | 704.002744 | 139.002126 | 136.604486 | 4482.587555 | 1984.803706 | 11.733379 | 13.456165 | 13.609533 | 34.196294 |
| std | 1642.204905 | 111.648819 | 111.335635 | 1659.195937 | 10.115522 | 27.370082 | 4.734282 | 4.748029 | 10.115522 |
| min | 61.000000 | 3.000000 | 3.000000 | 11.000000 | 1900.000000 | 1.016667 | 0.000000 | 0.000000 | 18.000000 |
| 25% | 323.000000 | 47.000000 | 44.000000 | 3799.000000 | 1980.000000 | 5.383333 | 9.000000 | 9.000000 | 27.000000 |
| 50% | 510.000000 | 104.000000 | 101.000000 | 4960.000000 | 1987.000000 | 8.500000 | 14.000000 | 14.000000 | 32.000000 |
| 75% | 789.000000 | 239.000000 | 238.000000 | 5505.000000 | 1992.000000 | 13.150000 | 17.000000 | 18.000000 | 39.000000 |
| max | 84548.000000 | 398.000000 | 398.000000 | 6645.000000 | 2001.000000 | 1409.133333 | 23.000000 | 23.000000 | 119.000000 |
#explore distribution of trip duration in minutes using plotly.express with logscale for x (reference 'https://plotly.com/python-api-reference/plotly.express.html')
px.histogram(data_frame=fordbike_2019, x= 'duration_min', hover_name='duration_min', log_x=True)
we can observe that the trip duration has a long-tailed distribution skewed to right, where the long time(more than 30 minutes) has few trips, and more than 90% of the trips have less than 1 hour long.
#let's use histplot, set xticks till 30 minutes to more interpret
sb.set_style('darkgrid')
plt.figure(figsize=[12,8])
bins = np.arange(0, 30, 1)
ticks = np.arange(0, 32, 2)
labels = ['{}'.format(v) for v in ticks]
sb.histplot(data= fordbike_2019, x= 'duration_min', bins = bins, shrink=0.8)
plt.xticks(ticks, labels)
plt.title('The Average Trip Duration in Minutes', fontsize=15, fontweight='bold')
plt.xlabel('Trip Duration in Minutes', fontsize=11, fontweight='bold')
plt.ylabel('Number of Trips',fontsize=11, fontweight='bold')
plt.axvline(fordbike_2019.duration_min.mean()+3, color='r')
plt.axvline(fordbike_2019.duration_min.mean()-7, color='r');
we can easily interpret from the above plot, that a majority of users have a tendency towards using the bikes for a short-time duration trip.Most trips take between 4 to 15 minutes
#plot number of trips starting each day using countplot
sb.set_style('darkgrid')
base_color = sb.color_palette()[0]
day_ordered = ['Monday','Tuesday','Wednesday','Thursday',
'Friday','Saturday','Sunday']
plt.figure(figsize=(12,8))
sb.countplot(data= fordbike_2019, y = 'start_day', color=base_color, order=day_ordered)
plt.title('Number of Trips Starting Each Day', fontweight='bold', fontsize=14)
plt.ylabel('WeekDays', fontweight='bold', fontsize=12)
plt.xlabel('Count Trips', fontweight='bold', fontsize=12);
From above figure we can see the most trips were taken on thursday while the least on saturday and sunday.
#plot number of trips starting each hour using countplot
sb.set_style('darkgrid')
base_color = sb.color_palette()[0]
hour_ordered = fordbike_2019.start_hour.value_counts().index
plt.figure(figsize=(12,8))
sb.countplot(data= fordbike_2019, y = 'start_hour', color=base_color, order=hour_ordered)
plt.title('Number of Trips Starting Each hour', fontweight='bold', fontsize=14)
plt.ylabel('Hours of The Day', fontweight='bold', fontsize=12)
plt.xlabel('Count Trips', fontweight='bold', fontsize=12);
We can see from above figure the most trips were taken at 5 PM and 8 AM, while the number of trips decreases after midnight.
#I will move on to the other features in the data : `user_type` , `member_gender`
fig, ax = plt.subplots(nrows=2, figsize=(10,6))
g = sb.countplot(data=fordbike_2019, x='user_type', color=base_color, ax= ax[0])
ax[0].set_xlabel('User Type', fontweight='bold', fontsize=12)
g.set_title('Average Trips for User Type & for Gender of Members', fontsize=16, fontweight='bold')
sb.countplot(data=fordbike_2019, x='member_gender', color=base_color, ax= ax[1])
ax[1].set_xlabel('Gender of Members', fontweight='bold', fontsize=12);
we noticed from previous figures the following :
#plot members age using histplot with kernal density estimation(kde), and use log scale
sb.set_style('darkgrid')
plt.figure(figsize=[12,8])
bins = 10 ** np.arange(1.25, np.log10(fordbike_2019.member_age.max())+0.1, 0.1)
ticks = [15, 20, 25, 30, 35, 40, 45, 50, 60, 65, 70, 75, 80, 85, 90, 95, 100]
labels = ['{}'.format(v) for v in ticks]
sb.histplot(data= fordbike_2019, x= 'member_age', bins=bins, kde=True)
plt.xscale('log')
plt.xticks(ticks, labels)
plt.title('Distribution of Members Age', fontsize=15, fontweight='bold')
plt.xlabel('Member Age', fontsize=11, fontweight='bold')
plt.ylabel('Number of Trips',fontsize=11, fontweight='bold');
According to previous figure, most members are around 25 to 45 years old.
I can observe that the trip duration in mintutes has a long-tailed distribution skewed to right, where the long time(more than 30 minutes) has few trips, and more than 90% of the trips have less than 1 hour long. so I use scale transformation. now I can easily interpret that a majority of users have a tendency towards using the bikes for a short-time duration trip.Most trips take between 4 to 15 minutes
From other plots, I observed that:
#plot start_day and duration_min using pointplot
plt.figure(figsize=(12,8))
day_ordered = ['Monday','Tuesday','Wednesday','Thursday',
'Friday','Saturday','Sunday']
sb.pointplot(data=fordbike_2019, x='start_day', y='duration_min',
color=base_color, order = day_ordered)
plt.xlabel('WeekDays', fontsize=16, fontweight='bold')
plt.ylabel('Trip Duration in Minutes', fontsize=12, fontweight='bold')
plt.title('Avg Trip Duration in Minutes Each Day', fontsize=16, fontweight='bold');
From figure above, we observed that all week days have short trips, except weekend (saturday and sunday) has long trips.
plt.figure(figsize=(12,8))
sb.barplot(data=fordbike_2019, x='start_hour', y='duration_min', color=base_color)
plt.xlabel('Hour', fontweight='bold', fontsize=12)
plt.ylabel('Trip Duration in Minutes', fontsize=12, fontweight='bold')
plt.title('Avg Trip Duration in Minutes Each Hour', fontsize=16, fontweight='bold');
I observed that most hours of the day have short trips except 2 am and 3 am have long trips.
#set derution_min less than or equal to 60 minutes
duration_65 = fordbike_2019.query('duration_min<=60')
#user_type, member_gender against duration_min using violinplot
fig, ax = plt.subplots(ncols=2, figsize=(10,8))
sb.violinplot(data=duration_65, x='user_type', y='duration_min', ax=ax[0], color=base_color, inner='quartile')
ax[0].set_xlabel('Type of User', fontsize=12, fontweight='bold')
ax[0].set_ylabel('Trip duration in Minutes', fontsize=12, fontweight='bold')
sb.violinplot(data=duration_65, x='member_gender', y='duration_min', ax=ax[1], color=base_color, inner='quartile')
ax[1].set_xlabel('Gender of Members', fontsize=12, fontweight='bold')
ax[1].set_ylabel('Trip duration in Minutes', fontsize=12, fontweight='bold');
We observed:
plt.figure(figsize=(10,5))
bins = 10 ** np.arange(1.25, np.log10(fordbike_2019.member_age.max())+0.1, 0.1)
ticks = [15, 20, 25, 30, 35, 40, 45, 50, 60, 65, 70, 75, 80, 85, 90, 95, 100]
labels = ['{}'.format(v) for v in ticks]
sb.scatterplot(data=fordbike_2019, x='member_age', y='duration_min', x_bins=bins, alpha=1/20)
plt.xscale('log')
plt.xticks(ticks, labels)
plt.xlabel('Member Age', fontsize=11, fontweight='bold')
plt.ylabel('Trip duration in Minutes', fontsize=12, fontweight='bold')
plt.yscale('log');
We observed that the relation between age and trip duration is negative as expected, as when the age increases, the duration of the trips decreases.
g = sb.FacetGrid(data=fordbike_2019,col='user_type', col_wrap = 2, height = 4, aspect=1.5, sharey=False, margin_titles=True)
g.map_dataframe(sb.countplot, 'start_hour', order=fordbike_2019.start_hour.value_counts().index)
g.set_titles(col_template='{col_name}')
for i in range(2):
g.axes[i].set_xlabel('Hours of The Day', fontsize=12, fontweight='bold')
g.axes[i].set_ylabel('Number of Trips');
Subscriber of bike share clearly peaks out on typical rush hours when people go to work in the morning and getting off work in the afternoon, while customer of bike share tend to ride most in the afternoon or early evening. It is clear that the number of trips for subscribers is twice the number of trips for the customer during the day.
g = sb.FacetGrid(data=fordbike_2019,col='user_type', col_wrap = 2, height = 4, aspect=1.5, sharey=False, margin_titles=True)
day_ordered = ['Monday','Tuesday','Wednesday','Thursday',
'Friday','Saturday','Sunday']
g.map_dataframe(sb.countplot, 'start_day', order = day_ordered)
g.set_titles(col_template='{col_name}')
for i in range(2):
g.axes[i].set_xlabel('WeekDays', fontsize=12, fontweight='bold')
g.axes[i].set_ylabel('Number of Trips');
As before, There was much more subscriber usage than casual customers overall. We note that the subscriber has more bike trips most days of the week except on weekends (Saturday, Sunday), while customers have bike trips every day of the week including weekend (Saturday, Sunday).
plt.figure(figsize=(10,6))
sb.countplot(data=fordbike_2019, x='user_type', hue='member_gender')
plt.xlabel('Type of User', fontweight='bold', fontsize=12)
plt.ylabel('Count', fontweight='bold', fontsize=12);
The Male Subscriber more than female, and also the male Customer more than the Female.
I got some observations for feature of interest with other features:
I got some observations for ralationships between the other features:
plt.figure(figsize=(12,8))
day_ordered = ['Monday','Tuesday','Wednesday','Thursday',
'Friday','Saturday','Sunday']
sb.pointplot(data=fordbike_2019, x='start_day', y='duration_min',
color=base_color, order = day_ordered,
hue='user_type')
plt.xlabel('WeekDays', fontsize=16, fontweight='bold')
plt.ylabel('Trip Duration in Minutes', fontsize=12, fontweight='bold')
plt.title('Avg Trip Duration in Minutes Each Day Depening on Type of User', fontsize=16, fontweight='bold');
From the chart above it appears that subscribers are riding much shorter rides compared to customers every day of the week. Both types of users have a clear increase in trip duration on Saturdays and Sundays during the weekends, especially for regular customers. The use of subscribers appears to be more efficient than clients in general and has maintained a very constant average duration from Monday to Friday.
day_ordered = ['Monday','Tuesday','Wednesday','Thursday',
'Friday','Saturday','Sunday']
trip_sub = fordbike_2019.query('user_type == "Subscriber"').groupby(['start_day', 'start_hour']).size()
trip_sub = trip_sub.reset_index(name='count').pivot(index='start_day', columns='start_hour', values='count')
trip_sub = trip_sub.reindex(day_ordered)
trip_cus = fordbike_2019.query('user_type == "Customer"').groupby(['start_day', 'start_hour']).size()
trip_cus = trip_cus.reset_index(name='count').pivot(index= 'start_day', columns='start_hour', values='count')
trip_cus = trip_cus.reindex(day_ordered)
plt.figure(figsize=(15,10))
plt.suptitle('Hourly Usage during Weekdays for Customers and Subscribers', fontsize=16, fontweight='bold')
plt.subplot(2,2,1)
sb.heatmap(trip_sub, cmap='rocket_r')
plt.xlabel('Hours of The Day', fontsize=12, fontweight='bold')
plt.title('Subscriber', fontsize=13, fontweight='bold')
plt.ylabel('')
plt.subplot(2,2,2)
sb.heatmap(trip_cus, cmap='rocket_r')
plt.xlabel('Hours of The Day', fontsize=12, fontweight='bold')
plt.title('Customer', fontsize=13, fontweight='bold')
plt.ylabel('');
The previous Heatmaps clearly show a very different usage pattern of use between the two types of users:
day_ordered = ['Monday','Tuesday','Wednesday','Thursday',
'Friday','Saturday','Sunday']
g = sb.FacetGrid(data=fordbike_2019,col='user_type', height = 4, aspect=1.5, sharey=False, margin_titles=True)
g.map_dataframe(sb.countplot, 'start_day', order=day_ordered, hue='member_gender')
g.set_titles(col_template='{col_name}')
g.add_legend();
I observed from previous plot:
Multivariate exploration reinforced some of the patterns detected in previous bivariate as well as univariate exploration.
The relation between hours of day and weekdays and the number of trips depending on the type of user:
Subscribers use the system heavily on working days, ie from Monday to Friday. Customers use a bikeshare a lot on weekends(saturday,sunday), especially in the afternoon. For Subscriber Most trips concentrate around 7-9 a.m. and 4-6 p.m.
For customer and subscriber user the male user for both has more trips than female.
I think there is no big surprise observed here. all the interactions between the features complement each other and make perfect sense.